%reset # Reset variables
# Data Science Packages
import numpy as np
import pandas as pd
# Visualization Packages
import plotly.express as px
# Statistics Packages
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
# For exporting plotly graphs to HTML
import plotly.io as pio
pio.renderers.default='notebook'
League of Legends is a complex MOBA game that consists of two, five player teams where each team tries to destroy the other team’s nexus to win the game. Ranked solo queue is the competitive version of the game in which one player joins a game with nine other random players of a similar rank to compete against one another. There are various levels of intricacies within the game that can sway the odds in an individual’s favor, such as slaying first dragon, acquiring more map objectives, or getting first blood. Often times players, especially those who are lower ranked, are overwhelmed by the complexity of the game. Understanding which key objectives are the most influential to winning a game will allow players to focus on acquiring these larger, tangible objectives over the minute intricacies within the game.

Hypothesis: Acquiring Epic Monsters (Baron Nashor and Elder Dragon) and inhibitors will yield in a statistically significant higher impact of winning a ranked match than other variables.
It is expected that epic monsters will have a significant impact on a game. Baron provides the team with an aura that increases their team's minions strength, and Elder Dragon provides the team with an aura that increases their team's total damage to the enemy team. Destroying an enemy inhibitor gives your team an extra, more powerful troop that can be used to distract the enemy team to take more turrets. Epic monsters and inhibitors are usually taken every game, so the data should be populated enough to show statistical significance. Other factors such as turret destruction, wards placed, and non-epic monsters only provide marginal benefit to your team, so they will have less of an impact on overall winning a game.
Game data can be overly complicated, so certain decisions were consciously made to ensure a more robust analysis. Only non-temporal numerical data is being considered. Each objective or data is by game rather than a timestamp within the game. All of the data is specific to one rank (Platinum) which consists of the top 10% of ranked players in the game. All of the game data is also from a single patch, or 'version' of the game for consistency. All of the game data is from the Blue Team, and map symmetry will be assumed. Which champions the players chose, or which champions they banned, will not be considered. There are over 140 different characters in total, and any combination of only 10 are in each game. Choosing which champion is best is outside of the scope of this project, and only overall map objectives are being considered.
Riot Games is the parent company of League of Legends. The Riot Games API where game data can be pulled from is quite complicated, so I decided to pull data already stripped from the API from a Data Scientist on Kaggle. An advantage of this method is that data extraction is simple and includes all relevant data that is needed for the analysis. A disadvantage is that I am not able to specifically control which data I am pulling from the API. Were the analysis more complicated, additional information would need to be pulled specifically from the API.
In fact, the only challenge I ran into during the process of collecting my data is using the Riot Games API. Learning how to pull information is quite outside the scope of this analysis, so it was indeed beneficial to look towards Kaggle for data collection since people with more knowledge about the API can better pull the necessary data.
The data extraction and preparation process can be seen in the below code. The code is commented to explain which each step is accomplishing.
Pandas is used to read in the data, store tabular data into matrices, and perform many data science methods for exploratory analysis and data manipulation.
Sklearn is used to split the data into training and testing sets which is important in validation to determine the model's accuracy. This is advantageous to use over manual splitting because of the 'random state' input. This allows the data to be randomly split, and is easily iterable to ensure that the data is not biased by the order. A disadvantage to using this tool is that the indices can be complicated, and one must take extra care to ensure each row is maintained properly.
Stats model is used to add a constant as is necessary with logistic regression.
# Read in the data from csv, and import it to a pandas dataframe
df = pd.read_csv('lol_ranked_games.csv') # Ref (1)
print(df.isnull().any().any()) # Double check there are no null values
# The data consists of tie 'frames' within the game. Since time is not considered in the analysis,
# only consider the last row of each game by using the groupby() method to group by the gameID,
# use the .last() method to take the final time frame of the game, and reset the index for consistency
df_clean = df.groupby('gameId').last().reset_index() # Remove the temporal aspect from the data and reset the index
df_describe = df_clean.describe() # Describe the statistics to look for outliers
False
The describe() method can be used to assess outliers or incorrect data by looking at interquartile ranges. The data seems very consistent other than two key aspects. One aspect is that the author incorrectly identified the nexus turrets and base turrets. As you can see in the snapshot below there should only be 2 nexus turrets, and there should be 3 base turrets. Nexus turrets should be dropped since you MUST destroy them before destroying the nexus, so there will always be two when a team won. The labels need to be corrected for proper analysis.


The second aspect is wards placed. As you can see in the describe() method, there are potential outliers with wards placed. This is most likely because players sit in base and spam place wards which is common player behavior when playing from behind. Logistic Regression doesn't handle outliers well, so these will be checked with a histogram and then imputated as necessary.

fig = px.histogram(df_clean['wardsPlaced'])
fig.update_layout(title='Wards Placed Histogram', xaxis_title='Wards Placed')
fig.show()
print('The number of wards placed greater than 400 is ' + str(sum(df_clean['wardsPlaced']>400)))
The number of wards placed greater than 400 is 248
# Drop Nexus Turrets since they are irrelevant to the analysis
df_clean=df_clean.drop(columns=['lostTopBaseTurret', 'lostMidBaseTurret', 'lostBotBaseTurret',
'destroyedTopBaseTurret', 'destroyedMidBaseTurret', 'destroyedBotBaseTurret'])
# Rename nexus turret variables to base turret due to the input error
df_clean=df_clean.rename(columns={'lostTopNexusTurret':'lostTopBaseTurret',
'lostMidNexusTurret': 'lostMidBaseTurret',
'lostBotNexusTurret': 'lostBotBaseTurret',
'destroyedTopNexusTurret': 'destroyedTopBaseTurret',
'destroyedMidNexusTurret': 'destroyedMidBaseTurret',
'destroyedBotNexusTurret': 'destroyedBotBaseTurret'})
# Imputate games where players placed more than 400 wards with the 75% quartile range
# 75% quartile range is used here since 248 total instances <<< 24912 total games, so more complex
# imputation is not necessary
iqr = np.percentile(df_clean['wardsPlaced'], 75)
df_clean['wardsPlaced'][df_clean['wardsPlaced']>400] = iqr
fig = px.histogram(df_clean['wardsPlaced'])
fig.update_layout(title='Wards Placed (Imputated with 75 percentile) Histogram', xaxis_title='Wards Placed')
fig.show()
C:\Users\qmeye\AppData\Local\Temp\ipykernel_4088\59327365.py:17: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
corr = df_clean.corr() # Correlation Matrix
corr = corr[corr>0.8] # Look for highly correlated variables in the data set
# How to deal with correlated variables:
# frame correlates to gameDuration --> Drop frame
# goldDiff, expDiff, championLevelDiff all correlate with one another, and to the dependent variable hasWon -- > drop all 3
# kills correlates to assists --> drop assists
# Drop highly correlated variables since they can interfere with Logistic Regression
df_clean = df_clean.drop(columns=['frame', 'goldDiff', 'expDiff', 'champLevelDiff'])
# Drop irrelevant data
# Kills, assists, and deaths are not key map objectives and are outside the scope of this work
# Wards Placed, wards lost, and wards destroyed are also not key map objectives and are therefore outside scope of this work
# Previous ward imputation work was kept in case the scope of this project ever changes
df_clean = df_clean.drop(columns=['gameId', 'assists', 'kills', 'deaths', 'wardsPlaced', 'wardsLost', 'wardsDestroyed'])
# Drop hasWon and assign to new variable y for train_test_split
y = df_clean['hasWon']
df_clean.drop(columns='hasWon', inplace=True)
# Split training & test data (30% is standard since >10000 variables, random_state for repeatability)
X_train, X_test, y_train, y_test = train_test_split(df_clean, y, test_size=0.3, random_state=42)
X_train = sm.add_constant(X_train) # Add constant for Logistic Regression model
X_test = sm.add_constant(X_test) # Add constant for Logistic Regression model
# The data is now prepared for Logistic Regression analysis --> (X_train, X_test, y_train, y_test)
Logistic Regression will be used to classify a 'Victory' or 'Defeat' for this data set. Logistic regression is best used when the dependent variable is a binary, categorical response (yes / no, or victory / defeat) to be predicted as it fits the probabilities to a sigmoid curve yielding either a zero or one as the classifying value. Logistic regression works with data that is Continuous, Discrete Nominal, and Discrete Ordinal. Given the type of data in this data set is all numerically continuous or discrete, there will be no issues implementing the project data as is. Additionally, the number of observations is much greater than the number of features which means that overfitting is not anticipated to be an issue.
Despite there being significantly more observations than variables, logistic regression is a linear model and can be affected by the curse of dimensionality. The curse of dimensionality is an issue with certain statistical models where more dimensions results in overfitting and a less significant meaning. Before implementing a logistic regression model for this data set, one must perform feature reduction to only input the most influential features to the model. Fortunately a methodology is widely accepted to reduce the number of features in a model called L1 (Lasso) regularization. This technique reduces the coefficients of irrelevant variables therefore only extracting the most influential data. This is accomplished by implementing a cost function, scaleable by lambda, where as lambda is increased the coefficients of certain variables are reduced to zero due to increasing variation. See the below L1 Regularization equation.

It is important to note that L1 (Lasso) Regularization is used over step-wise feature reduction due to its robustness. Step-wise feature selection has multiple issues, but the most severe issues are bias and overfitting (2).
# Finding Feature Importances using L1 (Lasso) Regularization
# Sklearn's LogisticRegression model is used since it has a L1 Penalty option
# C is the lambda term as seen above, but it implemented inversely, so the smaller the value the more terms will be eliminated
# The optimization solver is liblinear which is robust in the face of large data sets (3)
# Use varying values for c (lambda) to determine an appropriate value based on accuracy
c = [3e-4, 4e-4, 5e-4, 1e-3, 1e-2]
df_c = pd.DataFrame() # allocate memory to append data to
# Loop through Values of vector C to determine the respective accuracy
for C in c:
sel_ = LogisticRegression(C=C, penalty='l1', solver = 'liblinear') # Create Logistic Regression object with l1 penalty
sel_.fit(X_train, y_train) # Fit the model using the X_train and y_train data
acc = sel_.score(X_test, y_test) # Determine accuracy of the model
feature_importances_lr = X_train.columns[SelectFromModel(sel_).get_support()] # Determine which features were included in the model given C
df_c = pd.concat([df_c, pd.DataFrame({'Accuracy': [acc],
'Number of Features':[len(feature_importances_lr)],
'C':[C]})]) # Report back the accuracy score, number of important features, and C value
fig = px.line(df_c, x='C', y='Accuracy') # Create a line plot figure for C value vs Accuracy
# Loop through every value in df_c so a unique annotation of resulting feature number is added to each point
for i in range(0,len(df_c)):
if i == 3:
fig.add_annotation(x=df_c['C'].iloc[i], y=df_c['Accuracy'].iloc[i],
text='Features: ' + str(df_c['Number of Features'].iloc[i]),
showarrow=True,
arrowhead=1,
yshift=10)
else:
fig.add_annotation(x=df_c['C'].iloc[i], y=df_c['Accuracy'].iloc[i],
text='Features: ' + str(df_c['Number of Features'].iloc[i]),
showarrow=True,
arrowhead=1)
fig.update_layout(title='Accuracy vs L1 Penalty (C)') # Create a title
fig.show() # Show the figure
From the above figure it is seen that there is an elbow where the accuracy evens out at C=5e-4. This result yields 6 features that the L1 Regularization deems important. Fewer features should not be used because there is a severe reduction in accuracy. More features should not be included since there would only be marginal increases to accuracy and this can result in overfitting. In this case 90% accuracy captures most of the model, and an extra 4% is not necessary to explain the overarching trends.
Now this code is run again to store only the most important features.
# Extracting important features with the found optimized C value
sel_ = LogisticRegression(C=5e-4, penalty='l1', solver = 'liblinear') # Create Logistic Regression object with l1 penalty
sel_.fit(X_train, y_train) # Fit the model using the X_train and y_train data
feature_importances_lr = X_train.columns[SelectFromModel(sel_).get_support()] # Determine which features were included in the model given C
print(feature_importances_lr) # Print resulting important features
Index(['lostBaronNashor', 'destroyedBotInhibitor', 'lostBotInhibitor',
'destroyedBotBaseTurret', 'lostBotBaseTurret',
'destroyedBotInnerTurret', 'lostBotInnerTurret'],
dtype='object')
Random forest is a Machine Learning algorithm based on bootstrapped decision trees. While random forest is not a linear model like logistic regression, feature importances can be looked at for the data analyst to generally determine if they are on the right track with feature selection from L1 Regularization. This is advantageous because Sklearn's Logistic Regression model does not robustly output any justifications for the variables selected. The disadvantage is that the Random Forest algorithm is based on completely different mathematics that are not linear, so doing this could add more confusion than help if the data analyst does not have a good sense of the data they are working with. In this case, it is fine since the data is not overly complex and should be linearly related.
# Random Forest Feature Selection
rf = RandomForestClassifier() # Create Random Forest Object
rf.fit(X_train, y_train) # Fit Random Forest to the training data
y_predict = rf.predict(X_test) # Predict whether a team will win or not
print(accuracy_score(y_test, y_predict)) # Print accuracy score
feature_importances_rf = pd.DataFrame(rf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False) # Get feature importances from Random Forest
print(feature_importances_rf[0:len(feature_importances_lr)]) # Print the same number of features as chosen from L1 regularization
0.953973775755954
importance
destroyedBotBaseTurret 0.109313
lostBotInnerTurret 0.095116
lostBotBaseTurret 0.089524
destroyedBotInnerTurret 0.083337
lostBaronNashor 0.078422
destroyedBotInhibitor 0.070949
lostBotInhibitor 0.058010
Interestingly, Random Forest yields the same important features as the L1 (Lasso) Regularization. This implies that the majority of influence comes from these six features. Now the Logistic Regression model must be calculated using Stats Models' (SM) Logistic Regression function 'Logit.' While Sklearn's model is advantageous since it is able to implement a L1 penalty (which SM cannot), it has a disadvantage that the results are not easily interpretable. SM's logistic regression function will be used with the selected features due to its summary capability allowing easy visualization of P-values and coefficients which are pertinent to addressing the hypothesis.
df_reduced = df_clean[feature_importances_lr] # Only use features from earlier importance selection
# Split training & test data (30% is standard since 10000 variables, random_state for repeatability)
X_train, X_test, y_train, y_test = train_test_split(df_reduced, y, test_size=0.3, random_state=42)
X_train = sm.add_constant(X_train) # Add constant for model
X_test = sm.add_constant(X_test) # Add constant for model
model_original = sm.Logit(y_train, X_train).fit() # Fit Logistic Regresion Model from Statsmodel
print(model_original.summary()) # Statistical Summary
y_hat = np.round(model_original.predict(X_test)) # Round probabilities for confusion matrix
print('Confusion Matrix \n', confusion_matrix(y_test, y_hat)) # Print confusion matrix to assess any biases
print('Accuracy Score:', accuracy_score(y_test, y_hat)) # Double check accuracy to Sklearn's Logistic Regression model
Optimization terminated successfully.
Current function value: 0.235927
Iterations 8
Logit Regression Results
==============================================================================
Dep. Variable: hasWon No. Observations: 17438
Model: Logit Df Residuals: 17430
Method: MLE Df Model: 7
Date: Sun, 13 Nov 2022 Pseudo R-squ.: 0.6596
Time: 14:28:26 Log-Likelihood: -4114.1
converged: True LL-Null: -12087.
Covariance Type: nonrobust LLR p-value: 0.000
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
const 0.2269 0.042 5.437 0.000 0.145 0.309
lostBaronNashor -1.8359 0.057 -32.042 0.000 -1.948 -1.724
destroyedBotInhibitor 1.4924 0.073 20.396 0.000 1.349 1.636
lostBotInhibitor -0.9134 0.079 -11.599 0.000 -1.068 -0.759
destroyedBotBaseTurret 1.8159 0.084 21.610 0.000 1.651 1.981
lostBotBaseTurret -1.3285 0.085 -15.626 0.000 -1.495 -1.162
destroyedBotInnerTurret 1.9666 0.066 30.013 0.000 1.838 2.095
lostBotInnerTurret -1.4112 0.064 -22.196 0.000 -1.536 -1.287
===========================================================================================
Confusion Matrix
[[3213 451]
[ 235 3575]]
Accuracy Score: 0.9082151458389082
A logistic regression model with ~90.2% validation accuracy was able to be successfully implemented using the key important map objective features from the blue-team Platinum solo-queue data set. From Stats Model's Logistical Regression summary, one can see that each of the final features are statistically significant each with a p-value < 0.05.
The six key features in the data set are losing Baron Nashor, destroying the bottom inhibitor, losing the bottom inhibitor, destroying the bottom base turret, destroying the bottom inner turret, and losing the bottom inner turret. Each turret variable is binary (lost or maintained). Losing Baron Nashor or inhibitors can happen more than once, but typically do not occur substantially more than once in a game (see statistics). Each of these variables have around the same magnitude for their coefficient because of their occurence similarities, however, due to the lack of extreme differences one can determine that no single variable significantly impacts the dependent variable more than the others.
Destroying enemy inhibitor and losing friendly inhibitor, and destroying bot inner turret and losing bot inner turret both have similar coefficients because they are of roughly equal importance. Losing Baron Nashor and destroying bot base turret both have roughly equal, but opposite effects on the model.
It was hypothesized that the Epic Monsters and any lane's inhibitor would have the greatest effect on predicting victory or defeat. Surprisingly, it seems that the greatest effect on match outcome is actually through the bottom lane. Securing bottom turrets and inhibitor, or losing them, greatly effects win or loss more than Epic Monsters.
After assessing results, a key limitation of this analysis is that data only from the blue team is represented. As one can see, the resulting model is not as symmetrical as one might expect. For example, destroying Baron Nashor should be included if losing Baron Nashor is. Likewise, losing base turret should be included if destroying base turret is. If red team data was included, one could determine if the assymetrical nature of the map is influencing certain objective behavior on one team than the other. For the purposes if this analysis, symmetry was assumed.
Bottom lane objectives outweigh those of the other lanes. Additionally, slaying or losing dragons and rift heralds do not statistically significantly impact the outcome of the game. For this reason, players will increase their odds of winning if their focus is more directed primarily towards securing and defending bottom lane turret and inhibitor objectives, and also preventing Baron Nashor from being taken.
If a player is in the mid or top roles, they should roam and save teleport for securing or defending bottom objectives. These two roles emperically have low influence on the bottom lane, and a player should consider learning other positions if they want to consistently effect the outcome of a game.
If a player is in the jungle role, they should gank bottom more than other lanes. They should also make less risky plays in the bottom lane as this lane is more critical than the other lanes.
If a player is in the bottom or support role, their focus should always be on capturing their lane objectives. Any unnessary fighting and recalls should be avoided, and lane pressure / map vision should always be maintained to secure objectives.
Further analysis needs to be conducted in order to determine if there is a statistically significant difference between the blue and red team's variable populations. If there is a statsitically significant difference between the two team's variable populations, then an additional study would need to be performed analyzing the red team's key map objective influences on victory. These two studies could then be compared, and a course of action could then be suggested in order to provide the player with the best suggestions to win based on their team color.
An additional study that should be conducted is understanding the temporal domain's effect on the data. Taking certain map objectives too soon, or too late, could impact the outcome of the game. Destroying inhibitors, for example, provide an extra strong minion to help their team at a cost. If a player on the other team of the destroyed inhibitor kills the super minion, then they are rewarded with extra gold. If an inhibitor is taken too soon for the team to use the super minion effectively, then the defending team could use this to their advantage by acquiring extra gold. Studying the temporal effects of taking map objectives could give further insight to the player by informing them when the best time to take objectives is.
(1) League of Legends Soloq Ranked Games, (n.d.), Retrieved October 15th, 2022, from https://www.kaggle.com/datasets/bobbyscience/league-of-legends-soloq-ranked-games
(2) Sribney, Bill, What are some of the problems with stepwise regression?, (n.d.), Retrieved October 16th, 2022, from https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/
(3) Hale, Jeff, Don’t Sweat the Solver Stuff, (September 26, 2019), Retrieved October 17th, 2022, from https://towardsdatascience.com/dont-sweat-the-solver-stuff-aea7cddc3451